Predict demand for an online classified ad – Avito Demand Prediction

Table of Content

1.Introduction

  1.1. Problem Statement
  1.2. Dataset Overview
  1.3. Real-World/Business Objective and Constraint
  1.4. Evaluation Metrics

  1. Dataset Details

  2. Exploratory Data Analysis
      3.1. Checking for NULL Values
      3.2. Target Variable : deal_probability
      3.3. Categorical Features
        3.3.1. item_id & user_id
        3.3.2. region & city
        3.3.3. parent_category_name & category_name
        3.3.4. param_1 & param_2 & param_3
        3.3.5. user_type
      3.4. Numerical Features
        3.4.1. price
        3.4.2. item_seq_number
        3.4.3. image_top_1
      3.5. Text Features
        3.5.1. title
        3.5.2. description
      3.6. Image Features

  3. Feature Engineering Pipeline
      4.1. Fill Missing Values
        4.1.1. Categorical and Text Features
        4.1.2. Numerical Features
        4.1.3. Image Feature
      4.2. Interactive Features
      4.3 More Features
      4.4. Final Feature Engineering
        4.4.1. Categorical Features
        4.4.2. Numerical Features
        4.4.3. Text Features
        4.4.4. Image Features

  4. Machine Learning (Neural Network Architecture)
    Please follow topic specific notebook in same directory.

  5. Results and Conclusion

1. Introduction

Avito, Russia’s largest classified advertisements website, is deeply familiar with this problem. Sellers on their platform sometimes feel frustrated with both too little demand (indicating something is wrong with the product or the product listing) or too much demand (indicating a hot item with a good description was underpriced).

1.1. Problem Statement

In their Kaggle competition, Avito has challenged to predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts. With this information, Avito can inform sellers on how to best optimize their listing and provide some indication of how much interest they should realistically expect to receive.

When selling used goods online, a combination of tiny, nuanced details in a product description can make a big difference in drumming up interest.

Details like:

image.png

And, even with an optimized product listing, demand for a product may simply not exist–frustrating sellers who may have over-invested in marketing.

1.2. Dataset Overview

Data is provided by Avito as a open source through Kaggle competition. Total size of files provided are 146.76 GB.

Target Variable: Deal Probability Intuitively, a deal should be either Sold or Not Sold. But it’s not! We see numbers between 0 and 1. So, we are actually predicting Avito’s model. We have discussed this in later sections.

1.3. Real-world/Business Objective and Constraints

This model will help the stack-holder/ad publisher to clearly post the ad to maximize the probability of successful deal. With the score publisher can know what part of their ad is wrong among price, geographic location, description, image, etc.

Latency constraint: Normal to Very Slow

Performance metric: Lowest as possible

1.4. Evaluation Metrics

Root Mean Squared Error (RMSE) Submissions are scored on the root mean squared error. RMSE is defined as:

image.png

where y hat is the predicted value and y is the original value.

2. Dataset Details

Capture.JPG

train.csv and test.csv – Primary data
  Information Provided:
    o item_id - Ad id.
    o user_id - User id.
    o region - Ad region.
    o city - Ad city.
    o parent_category_name - Top level ad category as classified by Avito's ad model.
    o category_name - Fine grain ad category as classified by Avito's ad model.
    o param_1 - Optional parameter from Avito's ad model.
    o param_2 - Optional parameter from Avito's ad model.
    o param_3 - Optional parameter from Avito's ad model.
    o title - Ad title.
    o description - Ad description.
    o price - Ad price.
    o item_seq_number - Ad sequential number for user.
    o activation_date- Date ad was placed.
    o user_type - User type.
    o image - Id code of image. Ties to a jpg file in train_jpg. Not every ad has an image.
    o image_top_1 - Avito's classification code for the image.
    o deal_probability - The target variable. This is the likelihood that an ad actually sold something. It's not possible to verify every transaction with certainty, so this column's value can be any float from zero to one.

test.csv - Test data. Same schema as the train data, minus deal_probability.

train_active.csv - Supplemental data from ads that were displayed during the same period as train.csv. Same schema as the train data minus deal_probability, image, and image_top_1.

test_active.csv - Supplemental data from ads that were displayed during the same period as test.csv. Same schema as the train data minus deal_probability, image, and image_top_1.

periods_train.csv - Supplemental data showing the dates when the ads from train_active.csv were activated and when they where displayed.
    o item_id - Ad id. Maps to an id in train_active.csv. IDs may show up multiple times in this file if the ad was renewed.
    o activation_date - Date the ad was placed.
    o date_from - First day the ad was displayed.
    o date_to - Last day the ad was displayed.

periods_test.csv - Supplemental data showing the dates when the ads from test_active.csv were activated and when they were displayed. Same schema as periods_train.csv, except that the item ids map to an ad in test_active.csv.

train_jpg.zip - Images from the ads in train.csv.

test_jpg.zip - Images from the ads in test.csv.

3. Eploratory Data Analysis

3.1. Checking for Null Values

Quick Observation:

  1. Small percent of null values are observed in param_1, description, price, image, image_top_1.
    However, no null values are found in description of test data.
  2. Large percent of null values are observed in param_2, param_3.

3.2. Target Variable : deal_probability

Quick Observation

  1. 64% of data points have deal_probability are 0. Signifying No deals weres made for most of the ad posted.
  2. There are no data points with deal_probability between 0.88059 and 1.

3.3. Categorical Features

3.3.1. item_id and user_id

Quick Observation

  1. Cardinality of column item_id is very high. So we will not be using this in our model.
  2. Cardinality of column user_id is also very high. Cannot use directly in the model. We can extact features answers question below.
  3. 67929 user are common in train and test data.

https://www.kaggle.com/competitions/avito-demand-prediction/discussion/56117

  1. How many items has this user posted?
  2. How frequently does this user post?
  3. How long has it been since this user last posted?

3.3.2. region and city

3.3.3. parent_category_name & category_name

3.3.4. user_type

Quick Observation:

  1. Majority of ads 71.55% posted by 'Private' users. Different 'Company' have posted 23.09% of ads. While Small percentage of users are Shop with 5.32%.

3.4. Numerical Features

3.4.1. price

Some categories have higher average than other category.
Problem: Example - Average price of cars may act as a bias for average price of pens, so if wants to normalize the values of price columns. Very high price items like propery or vehicles may make normalized value of book as zero.
Solution: We will be handle data points belonging to each category differently.

Quick Observation:

  1. Items having Low deal probabilities have lower log price while its slightly higher for items having deal probability about 0.5

3.4.2. item_seq_number

3.4.3. image_top_1

Quick observation:

  1. All the numeric features have wide spectrum of values over category.

3.5. Text Features

3.5.1. title

3.5.2. description

3.6. Image

Quick Observation:

  1. total number of file exists in the train directory are 1390836 which is 0.925%.
  2. total number of file exists in the test directory are 465829 which is 0.916%.

Observation

  1. Approx 88 % training data having less than 0.5 deal probabilty. Remaining 12 % having probability more than or equal to 0.5.
  2. Top 5 Ad titles are :
    Платье(Dress)
    Туфли (Shoes)
    Куртка(Jacket)
    Пальто (Coat)
    Джинсы(Jeans)
  3. Top 5 Ad cities : Краснодар (Krasnodar)
    Екатеринбург (Yekaterinburg)
    Новосибирск (Novosibirsk)
    Ростов-на-Дону (Rostov-on-don)
    Нижний Новгород (Nizhny Novgorod)
  4. Top 5 Ad regions : Krasnodar Krai
    Sverdlovsk oblast
    Rostov oblast
    Tatarstan
    Chelyabinsk oblast
  5. Top 5 Fine grain ad category as classified by Avito's ad mode : Clothing, shoes and accessories
    Children clothing and shoes
    Childrens product and toys
    Apartments
    Phones
  6. Top 5 Top level ad category as classified by Avito's ad model : Personal belongings - 46 %
    For the home and garden - 12 %
    Consumer electronics - 12 %
    Real estate - 10 %
    Hobbies & leisure - 6 %
  7. Distribution of user types : Private users constitutes 71.6 % data
    Comapny users constitutes 23.1 % data
    Shop users constitutes 5.35 % data
  8. More Observation can be found in individual section.

4. Feature Engineering

We will consider image_top_1 as both categorical and numerical, because in the kaggle competition some teams have used this features as categorical feature and some as numerical feature.

4.1. Fill Missing Values

As we have discussed in subsection 3.1. train data has null values in param_1, param_2, param_3, description, price, image and image_top_1(image_top_1_C & image_top_1_N).

4.1.1. Categorical and Text Features

Categorical and Text feature replace with 'null' string rather than empty string which can be treated as new category.

4.1.2. Numerical Features

Price and image_top_1_N have missing values so rather than mean of all the ad's we followed the approch of filling them with category_name mean.

4.1.3. Image Feature

We have replaced the not available with default images or the dummy image.

4.2. Interactive Features

We have curated new 5 new categorical features using multiple features.

4.3 More Features

Based on the Text features we create new numerical features.

4.4. Final Feature Engineering

4.4.1. Categorical Features

Since categorical features cannot be understood by machine. We hace encoded the categorical features.

4.4.2. Numerical Features

In this subsection we have will transform the numerical features. Applying logirthmic transformation to avoid bais caused by normalization. Then batch_normalization layer.

4.4.3. Text Features

We have created the embedding_matrix using fasttext embedding vectors. Cleaned the text from title and description followed by encoding text with length 7 for title and 200 for description.

4.4.4. Image Features

We have very large amount of data. Since we are training on sample of data. Reading all the data and storing in numpy array will we 100's GB all. So we pass the images from VGG16 model and store the 512 features extracted from then without the top layers.

4.5. Feature Engineering Pipeline

5. Machine Learning

Please follow topic specific notebook in same directory.

6. Summary

This project is quite interesting, we have all types of data to solve a problem.

  1. Categorical Features
  2. Numerical Features
  3. Text Features
  4. Image Features

To summarise this project, we have built a 4 deep learning model, starting with just categorical data resuled in RMSE of 0.2301.
Then we added numerical data to build the model with two types of prior mentioned data which imporoved our RMSE value by 0.0017 to 0.2284.
For the third model we added Text added which gave our model boost of another 0.004, briging RMSE to 0.2244.
For the forth model we broght the Image data as well the models which didn't go as expected giving the RMSE of 0.239 because of improper hyperparameter tuning, sample of data to avoid memory error, using Resnet for feature engineering instead of VGG as done by top team from competition.

For the comparision to ML model we have also build the XGBoost model giving error of 0.2301.

7. Results and Conclusion

In the case study we have all the types of features like categorical, numerical, Text and Image. In the various model, we added each kind of data in the new model with complete datapoints with all the data. But for the final model we have used sample of datapoints (because of huge training time) but on all kinds of the data. More on models notebook.

We have summaried the result in the table below. Model trained with Categorical features, Numerical Features, Text features and Image features will be best in the ideal case. But in our experient of models adding image features have not increased the RMSE as expected because of reasons mentioned above.

Model with Feature RMSE Score
0 Mean Model 0.2600
1 XGBoost 0.2313
2 Embedding (Categorical Features) 0.2301
3 Embedding (CF) + Numerical Features 0.2284
4 CF + NF + FastText/LSTM (Text Features) 0.2244
5 CF + NF + TF + CNN (Image Features) 0.2390

Future Scope

  1. Due to limited domain knowledge and time. I have don't limited feature engineering for Image features. Features like dullness, blurrness, size, quality etc can also be included for better performance.
  2. Better hyperparameter can be used for each I have kept the hyperparameter almost same of all the models.
  3. For the final model, we have trained on sample of data for 20 epochs due to memory and computational limitations more training can result in better results.